February 11, 2025
\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]
A single layer, full connected neural network (FCNN) can be succinctly defined in two equations:
\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]
\[\theta_i = g(\bb^T \varphi(\mathbf W^T \mathbf x_i + \mathbf c) + b)\]
\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]
\[\theta_i = g(\bb^T \varphi(\mathbf W^T \mathbf x_i + \mathbf c) + b)\]
\(K\) hidden units
\(\bb\) is a \(K\)-vector of output layer weights
\(b\) is an output layer bias term
\(\varphi()\) is an activation function (like ReLU) that induces nonlinearities elementwise
\(\mathbf W\) is a \(P \times K\) matrix of weights. Each column of \(\mathbf W\) corresponds to a single hidden unit.
\(\mathbf c\) is a \(K\) vector of bias terms.
With large enough \(K\), a single layer FCNN is a universal approximator
Compared to other methods:
KNN is sort of \(\frac{N}{K}\)
Polynomial expansions are \(P^d\)
Kernel Methods are \(N\)
Trees are \(N\)
Note that the difference here is that \(N\) doesn’t play much of a role in determining how complex the NN approximator is!
When \(N\) is really large, NNs are really state of the art for classification/regression
It’s just a bunch of regressions
Arranged in a really clever way
And they tend to scale quite nicely to big data!
Another attractive feature of neural networks is explicit feature extraction
Recall that a FCNN can be thought of as a graph:
We map the original feature space to the hidden feature space
\[\mathbf h_i = \mathbf W^T \mathbf x_i + \mathbf c\]
A reasonable question is whether or not the \(K\) dimensional space actually corresponds to anything meaningful
To examine this, we’re going to look at the 3/8 problem using the MNIST handwritten digits data set.
14780 instance of handwritten 3s and 8s
784 pixel values in grayscale that correspond to one of the 28 by 28 pixel locations in the image
Small values correspond to background pixels and large values correspond to pencil pixels
Goal:
Build a classifier over the flattened pixels (e.g. ignore locational context) that separates 3s from 8s
from sklearn.neural_network import MLPClassifier
import numpy as np
y = y.reshape(-1)
# Create a 1-layer neural network with 4 hidden features, a ReLU activation function, and the Adam optimizer
clf = MLPClassifier(hidden_layer_sizes=(4,), activation='relu', solver='adam', random_state = 123)
# Train the neural network on X and y
clf.fit(X, y)
# Compute the accuracy of the classifier
accuracy = clf.score(X, y)
# Get the predicted labels for X
y_pred = clf.predict(X)
# Find the indices of the misclassified examples
misclassified = np.where(y != y_pred)
# Get the misclassified examples
X_misclassified = X[misclassified]
y_misclassified = y[misclassified]
import math
# Determine the number of misclassified examples
num_misclassified = X_misclassified.shape[0]
# Calculate the number of rows and columns for the subplots
num_rows = math.ceil(math.sqrt(num_misclassified))
num_cols = num_rows
# Create a new figure
plt.figure()
# For each misclassified example
for i in range(num_misclassified):
# Reshape the row into a 28x28 array
image = X_misclassified[i].reshape(28, 28)
# Add a subplot with the image
plt.subplot(num_rows, num_cols, i+1)
plt.imshow(image, cmap='gray')
plt.show()Roughly a .03% error rate
With only 4 hidden units, it takes sklearn about a second to train
784 features
14k observations!
Super scalable!
The more interesting part - what do the hidden units look like?
Specifically, what mapping does the NN do in each hidden unit to put pixels into the hidden space?
In this case, all of our feature values are positive
For a single hidden unit:
\[h_{ik} = \varphi(\mathbf W_k^T \mathbf x_i + \mathbf c_k)\]
Each weight for the hidden unit corresponds to a pixel
Positive weights multiplied by a positive value indicate that it is trying to activate that pixel in the hidden representation
Negative weights multiplied by a positive value indicate that it is trying to not activate that pixel in the hidden representation (e.g. push the inner product below zero so the ReLU just returns 0)
Weights that are essentially zero imply that the pixel doesn’t have any positive or negative relationship with the hidden feature that it is uncovering
The hidden layers are related to things that make 3s and 8s 3s and 8s
The empty space on the left hand side of a 3
The double circle structure of an 8
In a way, each hidden unit is attempting create a mapping from the feature space to a latent space where the latent space represents a major source of variation within the original feature space
Over all instances, let \(\mathbf Z\) be the collection of hidden values for each observation in each hidden feature (a \(N \times K\) matrix)
By definition:
\[\mathbf Z = \varphi \left( \mathbf X \mathbf W \right)\]
where \(\mathbf W\) is a \(P \times K\) matrix
This looks quite familiar…
\(\mathbf Z\) represents a latent space
Any thoughts?
This is a nonlinear mapping of the original feature space to a latent space
Keep this in the back of you mind!
Each hidden unit is creating a latent variable that relates to some aspect of the pixels that make it the digit.
The issue is that it’s trying to do everything at once!
After creating the hidden space, the goal is then to try to find a linear separator in the hidden space that separates the classes!
Let’s just look at the first three hidden units since this covers most of the variation in the data set.
The ReLU activation function just ensures that the approach appropriately emphasizes features rather than trying to create a direct PCA space
What makes a 3 a 3? What makes an 8 an 8? What makes a 3 different from an 8?
In reality, there’s a sort of hierarchical logic that helps us distinguish between digits
If it has a curvy top, a curvy bottom, and no pencil marks on the left, it’s a 3
If it has curves on the top and bottom and two sets of curves on the left and right, then it’s an 8.
A two layer, full connected neural network (FCNN) can be succinctly defined in two equations:
\[\mathcal L(\boldsymbol \theta | \mathbf X , \mathbf y) = \expn \mathcal L(\theta_i | \mathbf x_i , y)\]
\[\theta_i = g(\bb^T \varphi(\mathbf W_2^T \varphi(\mathbf W_1^T \mathbf x_i + \mathbf c_1) + \mathbf c_2) + b)\]
\(K_1\) hidden units in the first layer
\(K_2\) hidden units in the second layer
Two sets of hidden weight matrices: \(\mathbf W_2\) (\(P \times K_2\)) and \(\mathbf W_1\) (\(K_2 \times K_1\))
Why depth rather than width?
DNNs with an equivalent number of weight vectors to single layer NNs:
Achieve higher accuracy with less hidden units
Faster computationally than equivalent single layer model
How are we able to achieve higher accuracy with a DNN?
It really comes down to how it constructs the hidden units!
Let’s go back to the MNIST data.
from sklearn.neural_network import MLPClassifier
import numpy as np
y = y.reshape(-1)
clf = MLPClassifier(hidden_layer_sizes=(4,3), activation='relu', solver='adam', random_state = 123)
# Train the neural network on X and y
clf.fit(X, y)
# Compute the accuracy of the classifier
accuracy = clf.score(X, y)
# Get the predicted labels for X
y_pred = clf.predict(X)
# Find the indices of the misclassified examples
misclassified = np.where(y != y_pred)
# Get the misclassified examples
X_misclassified = X[misclassified]
y_misclassified = y[misclassified]
import math
# Determine the number of misclassified examples
num_misclassified = X_misclassified.shape[0]
# Calculate the number of rows and columns for the subplots
num_rows = math.ceil(math.sqrt(num_misclassified))
num_cols = num_rows
# Create a new figure
plt.figure()
# For each misclassified example
for i in range(num_misclassified):
# Reshape the row into a 28x28 array
image = X_misclassified[i].reshape(28, 28)
# Add a subplot with the image
plt.subplot(num_rows, num_cols, i+1)
plt.imshow(image, cmap='gray')
plt.show()Note that this does a little bit worse on the training data!
.45% inaccuracy rate
Any thoughts as to why?
Ultimately, will make some nice pictures for understanding!
The first layer does okay at separating the 3s and 8s
The goal of the first layer is not to classify!
The goal of the first layer is to arrange things in a way that can then be rearranged into optimal classification form.
A hierarchical approach!
With a two layer network:
First layer breaks the images up into a set of features (curves, thickness of lines, locations, etc.)
Second layer puts the new features into a classification formation
Output layer can draw a linear boundary!
DNNs work so well because they treat the feature extraction problem in a hierarchical way!
For classification, this is how a human would decide whether or not something is a 0 or a 1
With enough hidden units in the first layer, get very granular features that lead to easy decision making for the linear classifier
Not seeing the full benefit yet because this kind of feature structure isn’t that common in tabular data!
Why stop at two layers?
In theory, we could let our network be as deep as we want
The general expectation:
The first layer picks up a very granular set of features from the original feature space
The next layer aggregates those features into a different set
The next layer agregates again
On and on and on
This hierarchical feature extraction approach is the dominant one for images and text!
There’s also a nifty Bayesian interpretation of deep models over wide models
When we do machine learning, we’re setting prior beliefs about the kind of function that we should learn
A deep model is a very general belief that the overall function is a composition of a collection of simpler linear functions
What is a dog if not just a collection of lines arranged a specific way?
Each layer decreases the granularity of those simple linear approximations
A bunch of simple functions together estimate really complicated ones!
This is the basis of deep learning
Are DNNs universal approximators?
NN Universal Approximation theorem:
Suppose we have a ReLU network with \(P\) inputs, \(D\) layers, and \(K\) units per layer.
The number of lienar regions carved out by this ReLU network is:
\[\mathcal O \left( {K \choose P}^{P(D - 1)} K^P \right)\]
With a depth one network, the number of regions increases in \(K^P\)
With more layers, we’re amping up the number of regions
The cost at each layer is linear in the number of layers - \(D \times K\) \(P\) vectors of coefficients
DNNs can create really complex decision surfaces with not that many vectors of coefficients!
A first thought: Why not just fit skinny 1000 layer neural networks all the time?
Next week, we’re going to show that this isn’t computationally feasible due to the SGD training procedure
In theory, DNNs will always return what we want and interpolate the training data in a reasonable number of hidden units
Training NNs isn’t all roses, though.
Problem 1:
The NN objective function isn’t globally convex!
This is generically true but hard to prove.
So, let’s show this via a simple example.
Let’s consider a simple problem. 2 observations with one feature.
\[\mathbf X^T = [1,-1] \text{ ; } \mathbf y^T = [1,-1]\]
Our loss is MSE:
\[\mathcal L(\boldsymbol \theta | \mathbf X , y) = \expn (y_i - \hat{y}_i)^2\]
Use a one layer NN with 2 hidden units and a ReLU activiation:
\[\hat{y_i} = \boldsymbol \beta^T \mathbf z_i + b\]
\[z_i = \varphi(\mathbf w^T x_i + \mathbf c)\]
\[\mathbf X^T = [1,-1] \text{ ; } \mathbf y^T = [1,-1]\]
Let’s set all of the bias terms to zero. Consider two different sets of coefficients:
\[\mathbf w^T = [-1,1] \text{ ; } \bb^T = [-1,1]\]
\[\mathbf w'^T = [1,-1] \text{ ; } \bb'^T = [1,-1]\]
For each set of coefficients, compute the loss under MSE.
Convexity:
\[\mathcal L(\alpha\{\mathbf w, \bb \} + (1 - \alpha)\alpha\{\mathbf w', \bb' \} | \mathbf X , y) \le \mathcal L(\{\mathbf w, \bb \} | \mathbf X , y)\]
Setting \(\alpha = .5\), we get:
\[\mathbf w^T = [0,0] \text{ ; } \bb^T = [0,0]\]
What is the loss at this point?
In general, the NN loss function is not convex in the unknowns!
There are a lot of little local minima
Not all are created equally!
We have a flat minimum and a sharp minimum for the empirical loss function
We haven’t talked much about generalization for NNs (it’s coming), but this type of loss surface appears frequently for NNs and will have an effect on generalization
Which minimum do you think will generalize better?
SGD is only guaranteed to converge to one of these local minima!
But, SGD is also more likely to converge to the flat one!
Any intuition?
Think about the noisiness inherent in SGD.
SGD has also been shown to seek out flat minima that correspond to simpler models
A sort of funky result
Problem 2:
The gradient for a DNN can be really hard to work with
Lack of differentiability due to ReLU activations
Vanishing and exploding gradients due to activation function choice
Large hidden layers have very large Jacobians to describe the first derivative at unknowns
For a generic DNN, the chain rule is still used, but it can get really complicated really quickly.
Fortunately, there is a famous algorithm that will allow us to train DNNs of arbitrary depth pretty efficiently: backpropogation
Next class, we’ll discuss backpropogation and then address the problems with gradients for DNNs as much as possible